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Abstract 

Recommendation systems are information-filtering systems that tailor information to users on the basis of knowledge about 
their preferences. The ability of these systems to profile users is what enables such intelligent functionality, but at the same time, 
it is the source of serious privacy concerns. In this paper we investigate a privacy-enhancing technology that aims at hindering 
an attacker in its efforts to accurately profile users based on the items they rate. Our approach capitalizes on the combination of 
two perturbative mechanisms — the forgery and the suppression of ratings. While this technique enhances user privacy to a certain 
extent, it inevitably comes at the cost of a loss in data utility, namely a degradation of the recommendation's accuracy. In short, 
it poses a trade-off between privacy and utility. 

The theoretical analysis of said trade-off is the object of this work. We measure privacy as the Kullback-Leibler divergence 
between the user's and the population's item distributions, and quantify utility as the proportion of ratings users consent to forge 
and eliminate. Equipped with these quantitative measures, we find a closed-form solution to the problem of optimal forgery and 
suppression of ratings, and characterize the optimal trade-off surface among privacy, forgery rate and suppression rate. Experimental 
results on a popular recommendation system show how our approach may contribute to privacy enhancement. 

Index Terms 

Information privacy, Kullback-Leibler divergence, user profiling, privacy-enhancing technologies, data perturbation, recom- 
mendation systems. 

I. Introduction 

From the advent of the Internet and the World Wide Web, the amount of information available to users has grown exponentially. 
As a result, the ability to find information relevant for their interests has become a central issue in recent years. In this context 
of information overload, recommendation systems arise to provide information tailored to users on the basis of knowledge 
about their preferences |2|. In essence, a recommendation system may be regarded as a type of information-filtering system 
that suggests information items users may be interested in. Examples of such systems include recommending music at Last. fin 
and Pandora Radio, movies by MovieLens and Netflix, videos at YouTube, news at Digg and Google News, and books and 
other products at Amazon. 

Most of these systems capitalize on the creation of profiles that represent interests and preferences of users. Such profiles 
are the result of the collection and analysis of the data that users communicate to those systems. A distinction is frequently 
made between explicit and implicit forms of data collection. The most popular form of explicit data collection is that users 
communicate their preferences by rating items. This is the case of many of the applications mentioned above, where users 
assign ratings to songs, movies or news they have already listened, watched or read. Other strategies to capture users' interests 
include asking them to sort a number of items by order of predilection, or suggesting that they mark the items they like. 
On the other hand, recommendation systems may collect data from users without requiring them to explicitly convey their 
preferences 13). These practices comprise observing the items clicked by users in an online store, analyzing the time it takes 
users to examine an item, or simply keeping a record of the purchased items. 

The prolonged collection of these personal data allows the system to extract an accurate snapshot of user interests, i.e., 
their profiles. With this invaluable source of information, the recommendation system applies some technique [4 ] to generate a 
prediction of users' interests for those items they have not yet considered. For example, Movielens and Digg use collaborative- 
filtering techniques to predict the rating that a user would give to a movie and to create a personalized list of recommended 
news, respectively. In a nutshell, the ability of profiling users based on such personal information is precisely what enables 
the intelligent functionality of those systems. 

Despite the many advantages recommendation systems are bringing to users, the information collected, processed and stored 
by these systems prompts serious privacy concerns. One of the main privacy risks perceived by users is that of a computer 
"figuring things out" about them [5]. Many users are worried about the idea that their profiles may reveal sensitive information 
such as health-related issues, political preferences, salary or religion. Such privacy risk is exacerbated especially when these 
profiles are combined across several information services or enriched with data from social networks. An illustrative example 

Some parts of this paper (a reduced version of Secs.|l|and|n} were presented at the International Workshop on Data Privacy Management, Leuven, Belgium, 
Sep. 2011 QJ. The formulation of the trade-off between privacy and utility (Sec, [ill), the theoretical analysis (Sec. |IV), the experiments (Sec. |VJ and the 
conclusions (Sec. |VI| are all new work. 
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Fig. 1: The profile of a user is modeled in Movielens as a histogram of absolute frequencies of ratings within a set of movie genres (bottom). Based on this 
profile, the recommender predicts the rating that the user would probably give to a movie (top). After having watched the movie, the user rates it and their 
profile is updated. 

is (6), which demonstrates that it is possible to unveil sensitive information about a person from their movie rating history 
by cross-referencing data from other sources. The authors analyzed the Netflix Prize data set (7), which contained anonymous 
movie ratings of around half a million users of Netflix, and were able to uncover the identity, political leaning and even sexual 
orientation of some of those users, by simply correlating their ratings with reviews they posted on the popular movie Web site 
IMDb. Apart from the risk of cross-referencing, users are also concerned that the system's predictions may be totally erroneous 
and be later used to defame them. This latter situation is examined in (8), where the accuracy of the predictions provided by 
TiVo digital video recorder and Amazon is questioned. Lastly, other privacy risks embrace unsolicited marketing, information 
leaked to other users of the same computer, court subpoenas, and government surveillance (5). 

As a result of all this, it is not surprising that some users are reticent to reveal their interests. In fact, [ 9 ] reports that the 24% 
of Internet users surveyed provided false information in order to avoid giving private information to a Web site. Alternatively, 
another study [10 ] finds that 95% of the respondents refused, at some point, to provide personal information when requested by 
a Web site. In closing, these studies seem to indicate that submitting false information and refusing to give private information 
are strategies accepted by users concerned with their privacy. 



A. Contribution and Plan of this Paper 

In this paper we approach the problem of protecting user privacy in those recommendation systems that profile users on the 
basis of the items they rate. Given the willingness of users to provide fake information and elude disclosing private data, 
we investigate a privacy-enhancing technology (PET) that combines these two forms of data perturbation, namely the forgery 
and the suppression of ratings. Concordantly, in our scenario users rate those items they have an opinion on. However, in 
order to avoid being accurately profiled by the recommender or, in general, by any privacy attacker capable of collecting 
this information, users may wish to refrain from rating some of those items and/or rate items that do not reflect their actual 
preferences. Our approach thus protects user privacy to a certain degree, without having to trust the recommendation system 
or the network operator, but at the cost a loss in utility, a degradation of the quality of the recommendation. In other words, 
our PET poses a trade-off between privacy and utility. 

The theoretical analysis of the trade-off between these two contrasting aspects is the object of this work. We tackle the issue 
in a systematic fashion, drawing upon the methodology of multiobjective optimization. Before proceeding, though, we adopt a 
quantifiable measure of user privacy — the Kullback-Leibler (KL) divergence between the probability distribution of the user's 
items and the population's distribution, a criterion that we introduced in previous work [11 ] and justified and interpreted in fT2) , 
[ 13] by leveraging on the rationale behind entropy-maximization methods. Equipped with a measure of both privacy and utility, 
we formulate an optimization problem modeling the trade-off between privacy on the one hand, and on the other forgery rate 
and suppression rate as utility metrics. Our extensive theoretical analysis finds a closed-form solution to the problem of optimal 
forgery and suppression of ratings, and characterizes the optimal trade-off between the aspects of privacy and utility. 

In addition, we provide an empirical evaluation of our data-perturbative approach. Specifically, we apply the forgery and 
the suppression of ratings in the popular movie recommendation system Movielens, and show how these two strategies may 
preserve the privacy of its users. 

SecJII] reviews several data-perturbative approaches aimed at enhancing user privacy in the context of recommender systems. 
Sec. [ITl|introduces our privacy-enhancing technology, proposes a quantitative measure of the privacy of user profiles, and 
formulates the trade-off between privacy and utility. Sec. [TV] presents a theoretical analysis of the optimization problem 



3 



characterizing the privacy-forgery-suppression trade-off. In this same section we also provide a numerical example that illustrates 
our formulation and theoretical results. Sec. [V| evaluates our privacy-protecting mechanism in a real recommendation system. 
Finally, conclusions are drawn in Sec. |VlJ 

II. State of the Art 

Numerous approaches have been proposed to protect user privacy in the context of recommendation systems. These approaches 
fundamentally suggest either perturbing the information provided by users or using cryptographic techniques. 

In the case of perturbative methods for recommendation systems, [14] proposes that users add random values to their ratings 
and then submit these perturbed ratings to the recommender. After receiving these ratings, the system executes an algorithm 
and sends the users some information that allows them to compute the prediction. When the number of participating users is 
sufficiently large, the authors find that user privacy is protected to a certain extent and the system reaches a decent level of 
accuracy. However, even though a user disguises all their ratings, it is evident that the items themselves may uncover sensitive 
information. Simply put, the mere fact of showing interest in a certain item may be more revealing than the rating assigned 
to that item. For instance, a user rating a book called "How to Overcome Depression" indicates a clear interest in depression, 
regardless of the score assigned to this book. Apart from this critique, other works (H), (16) stress that the use of randomized 
data distortion techniques might not be able to preserve privacy. 

In line with this work, (17) applies the same data-perturbative technique to collaborative- filtering algorithms based on singular- 
value decomposition. Specifically, the authors focus on the impact that their technique has on privacy. For this purpose, they 
use the privacy metric proposed by 1 18], which is essentially equivalent to differential entropy, and conduct some experiments 
with data sets from Movielens and Jester. The results show the trade-off curve between accuracy in recommendations and 
privacy. In particular, they measure accuracy as the mean absolute error between the predicted values from the original ratings 
and the predictions obtained from the perturbed ratings. 

At this point, we would like to remark that the use of perturbative techniques is by no means new in other scenarios 
such as private information retrieval and the semantic Web. In the former scenario, users send general-purpose queries to an 
information service provider. A perturbative approach to protect user profiles in this context consists in combining genuine with 
false queries. Precisely, 1 11 ] proposes a nonrandomized method for query forgery and investigates the trade-off between privacy 
and the additional traffic overhead. In the semantic Web scenario, users annotate resources with the purpose of classifying 
them. In this application domain, the perturbation of user profiles for privacy preservation may be carried out by dropping 
certain annotations or tags. An example of this kind of perturbation may be found in (T9|-|2T|, where the authors propose the 
elimination of tags as a privacy-enhancing strategy. 

Regarding the use of cryptographic techniques, (22), (23) propose a method that enables a community of users to calculate 
a public aggregate of their profiles without revealing them on an individual basis. In particular, the authors use a homomorphic 
encryption scheme and a peer-to-peer communication protocol for the recommender to perform this calculation. Once the 
aggregated profile is computed, the system sends it to users, who finally use local computation to obtain personalized 
recommendations. This proposal prevents the system or any external attacker from ascertaining the individual user profiles. 
However, its main handicap is assuming that an acceptable number of users is online and willing to participate in the protocol. In 
line with this, [ 24 1 uses a variant of Pailliers' homomorphic cryptosystem which improves the efficiency in the communication 
protocol. Another solution (25) presents an algorithm aimed at providing more efficiency by using the scalar product protocol. 

III. Privacy Protection via Forgery and Suppression of Ratings 

In this section, first we present the forgery and the suppression of ratings as a privacy-enhancing technology. The description 
of our approach is prefaced by a brief introduction of the concepts of soft privacy and hard privacy. Secondly, we propose a 
model of user profile and set forth our assumptions about the adversary capabilities. Finally, we provide a quantitative measure 
of both privacy and utility, and present a formulation of the trade-off between these two contrasting aspects. 

A. Soft Privacy vs. Hard Privacy 

The privacy research literature (26) recognizes the distinction between the concepts of soft privacy and hard privacy. A 
privacy-enhancing mechanism providing soft privacy assumes that users entrust their private data to an entity, which is 
thereafter responsible for the protection of their data. In the literature, numerous attempts to protect privacy have followed the 
traditional method of anonymous communications (27)-(30), which is fundamentally based on the suppositions of soft privacy. 
Unfortunately, anonymous-communication systems are not completely effective |31 ]— 134|, they normally come at the cost of 
infrastructure, and assume that users are willing to trust other parties. 

Our privacy-protecting technique, per contra, leverages on the principle of hard privacy, which assumes that users mistrust 
communicating entities and therefore strive to reveal as little private information as possible. In the motivating scenario of 
this work, hard privacy means that users need not trust an external entity such as the recommender or the network operator. 
Consequently, because users just trust themselves, it is their own responsibility to protect their privacy. In this state of affairs, 
the forgery and the suppression of ratings appear as a technique that may hinder privacy attackers in their efforts to accurately 
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profile users on the basis of the items they rate. Specifically, when users are adhered to this technique, they have the possibility to 
submit ratings to items that do not reflect their genuine preferences, and/or refrain from rating some items of their interest — this 
is what we refer to as the forgery and the suppression of ratings, respectively. 



B. User Profile and Adversary Model 

In the scenario of recommendation systems, users rate items of a very different nature, e.g., music, pictures, videos or news, 
according to their personal preferences. The information conveyed allows those systems to extract a profile of interests or user 
profile, which turns to be essential in the provision of personalized recommendations. 

We mentioned in Sec. [I] that Movielens represents user profiles by using some kind of histogram. Other systems such as Jinni 
and Last.fm show this information by means of a tag cloud, which in essence may be regarded as another kind of histogram. In 
this same spirit, recent privacy -protecting approaches in the scenario of recommendation systems also propose using histograms 
of absolute frequencies for modeling user profiles (35), (36). 

According to these examples and inspired by other works in the field (T), (TTJ , (T9)-|2T), (37), we model the items rated 
by users as random variables (r.v.'s) taking on values in a common finite alphabet of categories, namely the set {1, . . . , n} for 
some integer n ^ 2. Concordantly, we model the profile of a user as a probability mass function (PMF) q = (gi, . . . , q n ), that 
is, a histogram of relative frequencies of items within a predefined set of categories of interest. 

We would like to emphasize that, under this model, user profiles do not capture the particular scores given to items, but 
what we consider to be more sensitive: the categories these items belong to. This is exactly the case of Movielens and 
numerous content-based recommendation systems. Fig. [T] provides an example that illustrates how user profiles are constructed 
in Movielens. In this particular example, a user assigns two stars to a movie, meaning that they consider it to be "fairly bad". 
However, the recommender updates their profile based only on the categories this movie belongs to. 

According to this model, a privacy attacker supposedly observes a perturbed version of this profile, resulting from the forgery 
and the suppression of certain ratings, and is unaware or ignores the fact that the observed user profile, also in the form of 
a histogram, does not reflect the actual profile of interests of the user in question. In principle, our passive attacker could be 
the recommender itself or the network operator. However, the set of potential attackers is not restricted merely to these two 
entities. Since ratings are often publicly available to other users of the recommendation system, any other attacker able to 
crawl through this information is taken into consideration in our adversary model. 

When users adhere to the forgery and the suppression of ratings, they specify a forgery rate p G [0, oo) and a suppression 
rate a G [0, 1). The former is the ratio of forged ratings to total genuine ratings that a user consents to submit. The latter ratio 
is the fraction of genuine ratings that the user agrees to eliminate [^] Note that, in our approach, the number of false ratings 
submitted by the user can exceed the number of genuine ratings, that is, p can be greater than 1. Nevertheless, the number of 
suppressed ratings is always lower than the number of genuine ratings. 

By forging and suppressing ratings, the actual profile of interests q is then perceived from the outside as the apparent 
PMF t = j^Ef > according to a. forgery strategy r — (7*1, . . . , r n ) and a suppression strategy s = (si, . . . , s n ). Such strategies 
represent the proportion of ratings that the user should forge and eliminate in each of the n categories. Naturally, these strategies 
must satisfy, on the one hand, that ^ 0, ^ and qi + — > for i = 1, . . . , n, and on the other, that Ym=i r i = P 
and Si = o-. In conclusion, the apparent profile is the result of the addition and the substraction of certain items to/from 

the actual profile, and the posterior normalization by 1 1 _ cr so that J27=i ^ = 1- 



C. Measuring the Privacy of User Profiles 

Inspired by the privacy measures proposed in (TT)-(T3), (19), (38) 
Sec. III-B we define initial privacy risk as the KL divergence 1 39 
distribution, that is, 

n = B(q\\p). 



and according to the model of user profile assumed in 
between the user's genuine profile and the population's 



Similarly, we define (final) privacy risk 7Z as the KL divergence between the user's apparent profile and the population's 
distribution, 

' q + r — s 



K = T>(t\\p)=T> 



1 + p-cr 



V 



An intuitive justification of our privacy metric stems from the observation that, whenever the user's apparent item distribution 
diverges too much from the population's, a privacy attacker will have actually gained some information about the user, in contrast 
to the statistics of the general population. 

A richer argument may be found in (12) , (13), where we establish some riveting connections between Jaynes' rationale on 
entropy-maximization methods and the use of entropies and divergences as measures of privacy. The leading idea is that the 
method of types from information theory establishes an approximate monotonic relationship between the likelihood of a PMF 



( a) The description of an architecture implementing this data-perturbative approach may be found in 
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in a stochastic system and its Shannon's entropy. Loosely speaking and in our context, the higher the entropy of a profile, 
the more likely it is, the more users behave similarly. This is in absence of a probability distribution model for the PMFs, 
viewed abstractly as r.v.'s themselves. Under this interpretation, Shannon's entropy is a measure of anonymity, not in the sense 
that the user's identity remains unknown, but only in the sense that higher likelihood of an apparent profile, believed by an 
external observer to be the actual profile, makes that profile more common, helping the user go unnoticed, less interesting to 
an attacker assumed to strive to target peculiar users. 

If an aggregated histogram of the population were available as a reference profile, as we assume in this work, the extension 
of Jaynes' argument to relative entropy also gives an acceptable measure of privacy (or anonymity). Recall [ 39 1 that KL 
divergence is a measure of discrepancy between probability distributions, which includes Shannon's entropy as the special 
case when the reference distribution is uniform. Conceptually, a lower KL divergence hides discrepancies with respect to a 
reference profile, say the population's, and there also exists a monotonic relationship between the likelihood of a distribution 
and its divergence with respect to the reference distribution of choice, which enables us to regard KL divergence as a measure 
of anonymity in a sense entirely analogous to the above mentioned. 



D. Formulation of the Trade -Off among Privacy, Forgery and Suppression 

Our data-perturbative mechanism allows users to enhance their privacy to a certain extent, since the resulting profile, as observed 
from the outside, no longer captures their actual interests. The price to be paid, however, is a loss in data utility, in particular 
in the accuracy of the recommender's predictions. 

For the sake of tractability, in this work we consider as utility metrics the forgery rate and the suppression rate. This 
consideration enables us to formulate the problem of choosing a forgery strategy and a suppression strategy as a multiobjective 
optimization problem that takes into account privacy, forgery rate and suppression rate. Specifically, under the assumption that 
the population of users is large enough to neglect the impact of the choice of r and s on p, we define the privacy-forgery- 
suppression function 

<or \ • ^fQ + r - s 
Jcip.a) = mm D 

v } r > s V 1 + P - cr 

qi+ri—Si^O, 

J2 r i=P, s i=°~ 

which characterizes the optimal trade-off among privacy, forgery rate and suppression rate. 

Conceptually, the result of this optimization are two strategies r and s that contain information about which ratings should be 
forged and which ones should be suppressed, in order to achieve the minimum privacy risk. More precisely, the component ri 
is the percentage of items that the user should forge in the category i. The component S{ is defined analogously for suppression. 
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IV. Optimal Forgery and Suppression of Ratings 
This section is entirely devoted to the theoretical analysis of the privacy-forgery- suppression function ([T]) defined in Sec. |III-D 



In our attempt to characterize the trade-off among privacy risk, forgery rate and suppression rate, we shall present a closed- 
form solution to the optimization problem inherent in the definition of this function. Afterwards, we shall analyze some 
fundamental properties of said trade-off. For the sake of brevity, our theoretical analysis only contemplates the case when all 
given probabilities are strictly positive: 

qi,Pi > for all i = 1, . . . , n. (2) 
Additionally, we suppose without loss of generality that 

(3) 

Pi Pn 

Before diving into the mathematical analysis, it is immediate from the definition of the privacy-forgery- suppression function 
that its initial value is 7^(0,0) = D(g \\p). The characterization of the optimal trade-off surface modeled by 1Z(p,a) at any 
other values of p and a is the focus of this section. 



A. Closed-Form Solution 

Our first theorem, Theorem [3] will present a closed- form solution to the minimization problem involved in the definition of 
function ([T]). The solution will be derived from Lemma [T] which addresses a resource allocation problem. This a theoretical 
problem encountered in many fields, from load distribution and production planning to communication networks, computer 
scheduling and portfolio selection |40| . Although this lemma provides a parametric-form solution, we shall be able to proceed 
towards an explicit closed-form solution, albeit piecewise. 

Lemma 1 (Resource Allocation): For all k = 1, . . . , n, let fk be a real-valued function on {(xk,yk) £ ^ 2 : ^k^^k—Vk ^ 0}, 
twice differentiable in the interior of its domain. Assume that f^- = — f^ 1 , that = > and that the Hessian H(h) 
is positive semidefinite. Define hk = Because > and |^ < 0, it follows that hk is strictly increasing in Xk and 



6 



strictly decreasing in y k . Consequently, for a fixed y k , h k (x k ,y k ) is an invertible function of x k . Denote by h k 1 the inverse 
of h k (x k , 0). Suppose further that h k (x k , y k ) = h k (x k — y k , 0) an d finally that lim h k (x k , y k ) = — oo. Now consider the 

following optimization problem in the variables x\,...,x n and yi, . . . , y n : 

n 

minimize ^f k (x k ,y k ) 
k=i 

subject to x k ,y k ^ 0, 

*>k + x k - Vk > for k = 1, . . . , n, 

n n 

and ^ Xk = ^, ^ Vk = # for some 77, ^ 0. 

fc=l A:=l 

(i) The solution to the problem ^) depends on two real numbers ip,u that satisfy the equality constraints ^ k x* k = 77 
and J2 k y k = 0. The solution exists provided that ip ^ uo. If ip < uj, then the solution is unique and yields 

( x hVk) = (maxjO,^ 1 ^)} ^axjO,-^ 1 ^)}) . 

If ip = oj, then there exists an infinite number of solutions of the form (x k + a kj yl + a k ) for all a k G R+ meeting the 
two aforementioned equality constraints. 
Without loss of generality, suppose that /ii(0, 0) ^ • • • ^ h n (0, 0). 

(ii) For ip < u, consider the following cases: 

(a) /ii(0, 0) < ip ^ /i i+ i(0, 0) for some i = 1, . . . , j — 1 and /ij_i(0, 0) ^ co> < foj(0, 0) for some j = 2, . . . , n. 

(b) /ij_i(0, 0) ^ for j = n + 1 and, either /ii(0, 0) < -0 ^ /^ + i(0, 0) for some i = 1, . . . , n — 1 or /ii(0, 0) < ^ for 
z = n. 

(c) t/> ^ ^ + i(0, 0) for z = and, either /ij_i(0, 0) ^ < (0, 0) for some j = 2, . . . , n or uj < /ij(0, 0) for j = 1. 

(d) ^_i(0, 0) ^ for j = n + 1 and ^ fc i+ i(0, 0) for i = 0. 
In each case, and for the corresponding indexes i and j, 

K 1 ^) > fc = i,...,« 

, k = z + 1, . . . , n ' 

, fc = l,...,j-l 

^ "i -h^H , fc = j,...,n 

(iii) For ip = uj, consider the following cases: 

(a) either /ii(0,0) < ip < /^(0,0) for some j = 2, ...,n and z = j - 1, or /ii(0,0) < ip = ^+i(0,0) = ••• = 

(0, 0) < /ij (0, 0) for some z = 1, . . . , j — 2 and some j = 3, . . . , n. 

(b) for j = n + 1, either /ii(0, 0) < /^ + i(0, 0) = • • • = /ij_i(0, 0) = for some z = 1, . . . , j — 2 or (0, 0) < uj 
with i = n. 

(c) for z = 0, either ^ = /^ + i(0,0) = ••• = /i J _ 1 (0,0) < /ij(0,0) for some j = 2, . . . , n or ip < /^ + i(0,0) with 
j = l. 

In each case, and for the corresponding indexes i and j, 



/ K 1 

\ OL k 



(i/;) + a k , fe=l,...,z 

, fc = i + 1, . . . , n 

* _ f afe , = 1,..., j - 1 



Proof: The proof of statement (i) consists of two steps. In the first step, we show that the optimization problem stated in 
the lemma is convex; then we apply Karush-Kuhn-Tucker (KKT) conditions to said problem, and finally reformulate these 
conditions into a reduced number of equations. The bulk of this proof comes later, in the second step, where we proceed to 
solve the system of equations for the two cases considered in the lemma, ip < u and ip = uj. Lastly, statements (ii) and (iii) 
follow from (i). 

To see that the problem is convex, simply observe that the objective function is convex on account of H(f k ) >z 0, and that the 
inequality and equality constraint functions are affine. Since the objective and constraint functions are also differentiable and 
Slater's constraint qualification holds, KKT conditions are necessary and sufficient conditions for optimality |4T| . Systematic 
application of these optimality conditions leads to the Lagrangian cost, 



£> = ^2fk(x k ,y k ) - ^\ k x k - ^2/LikVk + y^^kjyk ~ Kk ~ x k ) - ip x k rj) + uj 



Vk 
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and finally to the conditions 

Vk > 0, 

(primal feasibility) 
(dual feasibility) 

Afc x k = 0, fj, k y k = 0, 

v k (Vk — Kk — x k) = 0, (complementary slackness) 

h k (x k , Vk) ~ Afc - v k - ip = 0, 
§£ = h k (x k , Vk) + Hk - v k - u = 0. (dual optimality) 

Because lim hk(xk, Vk) = — oo, it follows from the dual optimality conditions that + — y k > 0, which implies, 

by complementary slackness, that v k = 0. Subsequently, we may rewrite the dual optimality conditions as \ k = h k (x k , y k ) — tp 
and fi k = uj — h k (x kj y k ). By eliminating the slack variables [i k , we obtain the simplified conditions h k (x kj y k ) ^ ip and 
hk(xk,yk) ^ u. Lastly, we substitute the above expressions of A& and /i^ into the complementary slackness conditions, so 
that we can formulate the dual optimality and complementary slackness conditions equivalently as 



hk(xk,Vk) > ip, (4) 

h k (x k ,y k ) < u, (5) 

(h k (x k ,yk) -ip)x k = 0, (6) 

(h k (x k: y k ) -u)y k = 0. (7) 



In the following, we shall proceed to solve these equations which, together with the primal and dual feasibility conditions, 
are necessary and sufficient conditions for optimality. To this end, first note that, if ip > uj, then there exists no (x k ,y k ) that 
satisfies equations ^ and §5§ at the same time, and consequently, as stated in part (i) of the lemma, there is no solution. 
Concordantly, next we shall study the case when ip < uj; afterwards we shall tackle the other case when ip = uj. 

Before plunging into the analysis of the former case, recall that the function h k is strictly increasing in x k and strictly 
decreasing in y k . Having said this, observe that, under the assumption ip < uj, the variables x k and y k cannot be positive 
simultaneously by virtue of equations ^ and ([7]). Bearing this in mind, consider these three possibilities for each k: h k (0, 0) < 
ip, ip < h fe (0,0) < uj and uj < h k (0,0). 

When h k (0,0) < ip, the only conclusion consistent with ^ and with the fact that h k is strictly increasing in x k is that 
x k > 0. Since x k must be positive, the complementary slackness condition ^ implies that h k {x k ,y k ) = ip and, because 
of ([7]), that y k = 0. As a result, x k must satisfy h k (x k ,0) = ip, or equivalently, x k = h^iip). Next, we show that the 
solution (x k ,0) is unique. For this purpose, suppose that y k > and, in consequence, that x k = 0. It follows from ([7]), 
however, that h k (0,y k ) = uj, which contradicts the fact that h k is a strictly decreasing function of y k . In the end, we verify 
that x k = y k = does not satisfy ^ and thus prove that (x k ,y k ) = (h^ 1 (ip) ^ 0) is the unique minimizer of the objective 
function when h k (0,0) < ip. 

Now consider the case when ip ^ h k (0,0) ^ uj. First, suppose that x k > 0, and therefore that y k = 0. By complementary 
slackness, it follows that h k {x k , 0) = ip, which is not consistent with the fact that h k is strictly increasing in x k . Consequently, 
x k cannot be positive. Secondly, assume that x k is zero and y k positive. Under this assumption, equation ([7]) implies that 
h k (0,y k ) = uj, a contradiction since h k is a strictly decreasing function of y k . Accordingly, y k cannot be positive either. 
Finally, check that Xk = y k = satisfies the optimality conditions and hence it is the unique solution. 

The last possibility corresponds to the case when uj < h k (0, 0). Note that, in this case, the only conclusion consistent with ([5]) 
and with the fact that h k is strictly decreasing in y k is that y k > 0. Thus, because of ([7]), y k must satisfy h k (0^yk) = uj. 
Recalling from the lemma that hk(xk,yk) = hk(xk — y k ,0), we may express the condition h k (0,y k ) = uj equivalently as 
y k = —h^iuo). Lastly, we check that this solution is unique in the case under study. To this end, note that a solution such 
that x k > and y k = contradicts the fact that h k is strictly increasing in x k . As a result, x k cannot be positive. Finally, 
we confirm that equation §5§ does not hold for x k = yk = and therefore prove that (xk,yk) — (0, —K^ x (uj)) is the unique 
solution when uj < h k (0,0). 

In summary, x k = h^ r (ip) if h k (0,0) < ip, or equivalently, h^ 1 ^) > 0; otherwise Xk = 0. Further, y k = —h^ 1 ^) if 
h k (0,0) > uj, or equivalently, h^^uo) < 0; otherwise y k = 0. Accordingly, we may write the solution compactly as 

(x k ,y k ) = (maxjO,/^ 1 ^)} jinaxjOj-ftfc 1 ^)}) , 

where ip,uj must satisfy the primal equality constraints Y^ k x k = V and ^ k y k = 6. 

Having examined the case when ip < uj, next we proceed to solve the optimality conditions at hand for ip = uj. Observe 
that, in this new case, ^ and §5§ transform into the equation 



Xk ^ 0, y k ^ 0, K k + x k - 

T, x k = v,T,yk = o, 

A/c ^ 0, Li k ^ 0, v k > 0, 



hk(xk,Vk) = ip- 



(8) 
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Moreover, note that any pair (xk,Vk) satisfying ^ also meets the complementary slackness conditions ^ and ([7]). However, 
notice that this does not mean that all those pairs are optimal. To elaborate on this point, consider the following three possibilities 
for each fc: ft*; (0,0) < ip, ft fe (0,0) = ip and ip < h fc (0,0). 

In the case when hk(0, 0) < ip, the only condition consistent with ^ and with the fact that hk is strictly increasing in Xk is 
that Xk > 0. From the lemma, it is immediate that |^ = — which implies that must also be greater than Hence, 
the set of solutions is 

{(xk, Vk) : h k (x k , y k ) = ip, x k > y k }> 

where every pair in this set must also fulfill the primal equality conditions. Let x' k satisfy h k (x f k ,0) = ip, or equivalently, 
x' k — h k x {ip). Then, because h k (x' k + a k , a k ) = ip for any a ^ 0, this set may be recast equivalently as 

{(x k , Vk) ' %k = x k + a k , Vk = otk}- 

For the two remaining cases, i.e., h k (0, 0) = ip and -0 < ^(0, 0), the set of solutions is obtained in a completely analogous 
way as above. In the former case, the pairs (x k: y k ) must satisfy x k = y k , and the set of solutions may be expressed as 

{(xk, yk) ' x k = a k , yk = oi k }. 
In the latter case, it follows that y k > x k and, consequently, that the set of solutions is 

{(x k , yk) x k = a k , yk = y'k + a k }, 

where y k must satisfy h k (0,y k ) = ip. 

To sum up, the case ip = uj leads to the following solutions: Xk = h k x (ip) + a k if ^(0,0) < ip, or equivalently, 
h k r (ip) > 0; otherwise x k = oik- In addition, y k = — h^iuj) + if ^(0,0) > uj, or equivalently, h^iuj) < 0; otherwise 
y k = a k . Accordingly, the solutions (xk,yk) yield 

(max{0, h^ 1 ^)} + a fe , max {0, - ft^ ^u;)} + a fe ) , (9) 

for some ip,uj and nonnegative sequence cti, . . . , a n such that Xk = r] and = 0. Note that, although ip — uj, we 

intentionally write instead of ip to highlight that the solutions for ip < uj and for ip = uj just differ in the term ak, as we 
claimed in part (i) of the lemma. 

To complete the proof of statement (i), it suffices to show that the number of solutions is infinite when ip = uj. To this end, 
simply observe that there exists an infinite number of sequences ai, . . . , a n such that 

J2 Xk = ^2 h k 1 W + ^2 a k = V and 

k k k 

= - h k 1 wo + a k = e > 

k k k 

which results in an infinite number of solutions of the form given in ([9]). 

Now we proceed to prove (ii), which is an immediate consequence of (i). For this purpose, observe that if ip ^ /^ + i(0, 0) ^ 
• • • < h n (Q, 0) holds for some i = 0, . . . , n — 1, then h^ +1 (ip), . . . , h^(ip) ^ 0, and accordingly Xi+i = • • • = x n = 0. 
Similarly, if /ii(0, 0) ^ • • • < ftj_i(0, 0) ^ uj is satisfied for some j = 2, . . . , n + 1, then . . . , hj\(uj) ^ 0, and thus 

yi = •" = yj-i 0. 

Note that the particular case when the index i ranges from 1 to j — 1 and the index j goes from 2 to n is the case described 
in (ii) (a), which corresponds to 77, > 0. Further, observe that the case assumed in (ii) (b), i.e., when j = n + 1, implies that 
6 = 0. Here, the index i starts at 1, therefore excluding 77 = 0, and ends at n, including the possibility that X{ > for all i. 
In part (ii) (c), we consider i = 0, which is equivalent to the condition 77 = 0. In this case, the index j starts at 1, permitting 
yj > for all j, and ends at n, avoiding = 0. Finally, the case described in (ii) (d), namely when j = n + 1 and z = 0, is 
precisely the trivial case x = y = 0. 

In order to verify statement (iii), we proceed analogously by noting that if ip = /^ + i(0,0) = • • • = hj -±(0,Q) holds for 
some i = 1, ... , j — 2 and some j = 3, . . . , n, then h^^ip) = • • • = hj\(ip) = 0, and consequently Xk = yk = <^fc for 
fc = i + l,...,j-l. ■ 

The previous lemma presented the solution to a resource allocation problem that minimizes a rather general but convex 
objective function, subject to affine constraints. Our next theorem, Theorem |3j applies the results of this lemma to the special 
case of the objective function of problem ([T]). In doing so, we shall confirm the intuition that there must exist a set of ordered 
pairs (p, a) where the privacy risk vanishes and another set where it does not. We shall refer to the former set as the critical- 
privacy region and formally define it as 

tf = {(p,a):K(p,a)=0}. 
The latter set will be the complementary set and we shall refer to it as the noncritical-privacy region. 
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Before proceeding with Theorem [3] first we shall introduce what we term forgery and suppression thresholds, two sequences 
of rates that will play a fundamental role in the characterization of the solution to the minimization problem defining the 
privacy-forgery-suppression function. Secondly, we shall investigate certain properties of these thresholds in Proposition [2] 
And thereafter, we shall introduce some definitions that will facilitate the exposition of the aforementioned theorem. 

Let Qi = Yjk=i Q k anc * Pi = Sfe=i Pk be the cumulative distribution functions corresponding to q and p. Denote by 
Qi = J2k=i Qk and Pi = J2k=i Pk m e complementary cumulative distribution functions of q and p. Define the forgery 
thresholds pi as 

" Pif.-Qi , i = i,...,j-i 

-j^iQj -o) -Qj-u i = 3 
oo , i = j + 1 

for j = 2, . . . , n. Additionally, define the suppression thresholds crj as 

- qj 

a 3 = Qj ~ Pj ~ 

Pj 

for j = 1, . . . , n, and <r = 1. Observe that pi = a n — and that the forgery threshold pj is a linear function of a. We shall 
refer to this latter threshold as the critical forgery-suppression threshold and denote it also by p C nt(&)- The reason is that said 
threshold will determine the boundary of the critical-privacy region, as we shall see later. The following result, Proposition [2] 
characterizes the monotonicity of the forgery and the suppression thresholds. 
Proposition 2 (Monotonicity of Thresholds): 

(i) For j = 3, . . . , n and i = 1, . . . , j — 2, the forgery thresholds satisfy pi ^ Pi+i, with equality if, and only if, ^ = 

(ii) For j = 2, . . . , n, the suppression thresholds satisfy aj ^ with equality if, and only if, ^ = 

(iii) Further, for any j = 2, . . . , n and any a £ (aj,aj_i], the critical forgery-suppression threshold satisfies Pj(cr) ^ Pj-i, 
with equality if, and only if, a = 

Proof: The first statement can be shown from the definition of the forgery thresholds by routine algebraic manipulation and 
under the labeling assumption ([3]). To this end, it is helpful to note that 

* i — ^2+1 V't+1- 

The second statement can be shown analogously, observing that 

Q 7 -P? — Qj—i -P? — i • 
Pj-i Pj-i 

For the last statement, use the definitions of the forgery and the suppression thresholds to note that the condition Pj(cr) ^ pj-i 
is equivalent to a ^ ■ 
Prior to investigate a closed-form solution to the problem ([I}, we introduce some definitions for ease of presentation. For 
i = 1, . . . , j — 1 and j = 2, . . . , n, define 

q= (Qi, qi+i , ■ ■ ■ , , Oj ), 
f = ( p , , ), 

5= ( , , a ), 

P= {Pi ,Pi+i , • • • , ), 

where g and p are distributions in the probability simplex of j — i + 1 dimensions, and f and s are tuples of the same dimension 
that represent a forgery strategy and a suppression strategy, respectively. Particularly, note that the indexes i = 1 and j = n 
lead to g = g and p = p. 

Theorem 3: Let <9^ be the boundary of <*f , and cl the closure of . 

(i) <9<*f C ^ and 

= {(p,a): p = Pj{p)i<r G [a^cr^i], for j = 2, . . . ,n}. 

(ii) For any (p, a) G cl^, either p G [p^, Pi+i] for z = 1 or p G (/9^, for some i = 2, . . . , j — 1, and either a G [aj, <7j-i] 
for j = n or a G for some j = 2, . . . , n — 1. Then, for the corresponding indexes z, j, the optimal forgery 
and suppression strategies are 



fr(Q* + p) - , = 1, . . . ,Z 

, fe = 2 + 1, . . . , n 

, fc = l,...,j-l 

4fc - ^(Qj -cr), fe = 
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and the corresponding, minimum KL divergence yields the privacy-forgery-suppression function 

q + f — s 



K(p,a) = T> 



1 



Proof: The proof is structured as follows. We begin by showing that the optimization problem ([I]) may be construed as a 
particular case of that stated in Lemma [T] Accordingly, we apply this lemma, namely the cases (ii) and (iii), to obtain the 
optimal forgery and suppression strategies. The application of the former case allows us to derive the solution for (p, a) G c tf. 
The latter case enables us, first, to confirm that this solution is also valid on dff, and secondly, to prove statement (i). Lastly, 
we complete the proof of (ii) by expressing function ([T]) in terms of the optimal apparent distribution. 

Use the definition of KL divergence to write the objective function of the optimization problem as D(t || p) = ^ k tj~ log 
with t = • Observe that the functions fk^k-, Sk) — tk log ^ are twice differentiable on {(r^, Sk) - qk + ^k — $k > 0}. 

Denote by hk the derivative of fk with respect to r&, 

h k (r k , s k ) = flog ^ +rfe ~ gfc + l) . (10) 

1 + p - <T \ (1 + P ~ cr)Pk J 

Then, note that the functions fk and hk satisfy the assumptions of Lemma [T] and that the inequality and equality constraints 
of function ([I]) coincide with those in the lemma. This exposes the structure of the optimization problem as a special case of 
the resource allocation lemma. 

Before proceeding any further, notice from ( [T0| ) that hk(rk,0) is a strictly increasing function of r& and hence invertible. 
Note also that, according to the lemma, the solutions are completely determined by the inverse of this function, which is 
denoted by h^ 1 and yields 

K 1 (<t>) = Pfc(l + P~ a)2^+"-^- 1 - q k . 

Finally, observe that the assumption /ii(0, 0) ^ • • • ^ ft n (0,0) in the lemma is equivalent to the labeling assumption ([3]), as 
/ifc(0,0) is a strictly increasing function of 

Next we apply Lemma [T] (ii), where it is assumed the condition ip < uo. We start with case (ii) (a). On account of part (i) 
of the lemma, the optimal forgery strategy must satisfy 



P = E h ? WO = P ^ + P - a)2( 1+ "-^" 1 

k = l 

or equivalently, 



Analogously for the suppression strategy, 

n 
k=j 

and therefore 

1 ( Qj-a , 1 
l°g 7 + 1 



1 + p- crV (1 + P ~ cr)Pj 

Then it suffices to substitute the expressions of ip and uo into the function h^ 1 , to obtain the nonzero optimal solutions claimed 
in assertion (ii) of the theorem. 

Now we proceed to confirm the interval of values of p and a where these solutions are defined. In the case under study, ip 
and uo satisfy hi(0, 0) < ip ^ /^ + i(0, 0) for some i = 1, . . . , j — 1 and ftj_i(0, 0) ^ uo < hj(0, 0) for some j = 2, . . . , n. We 
split the discussion into two cases, namely i < j — 1 and i = j — 1. 

Assume the former case. Observe that the condition /^(0,0) < ip is equivalent to 

1 f log ^'(V a 



1 + p — cr \ (1 + p - cr)^ / 1 + p — cr \ (1 + p — cr)Pi 
and finally, after routine algebraic manipulation, to 

P>P*--Q*. 
Pi 

Similarly, the upper-bound condition ip ^ /^+i(0,0) leads to 

p < Pi Qi- 

Pi+i 



11 



Hence, the intervals resulting from imposing /^(0,0) < i/j ^ ^i+i(0, 0) are of the form (p^p^+i]. The monotonicity of the 
thresholds p^, demonstrated in Proposition [2j guarantees that these intervals are contiguous and nonoverlapping. In an analogous 
manner, it can be shown that the condition /ij_i(0, 0) ^ uj < hj(0, 0) leads to intervals of the form (<7j, (Tj-i], also contiguous 
and nonoverlapping by virtue of Proposition [2] 

Now assume the latter case, where ^(0,0) < ip < uj < hj (0,0) with i = j — 1. On the one hand, the assumption 
/ij_i(0,0) < i/j is, as shown above, equivalent to the condition p > pj-i. On the other hand, straightforward manipulation 
allows us to write the inequality ip < uj as 



p. 

p < -j^(Qj -cr)- Qj-i. 

*3 



Combining these two bounds on i/j, we obtain the interval (pj-i, Pcnt(c))- With this last interval, we complete the range of 
validity of the solution for the case (ii) (a) in the lemma. Ultimately, it is easy to verify that, in those intervals of p and cr, the 
optimal apparent profile t = f^Ef does not coincide with the population's profile p. In consequence, D(t \\p) > 0. 

Next, we turn to case (ii) (b) of the lemma. Here, the assumption h n (0,0) ^ uj leads to cr = 0, or equivalently, to the solution 
8 = 0. Note that, precisely, this is the solution given in the theorem for cr = crj with j = n. On the other hand, the application 
of the condition J2k=i r k = P results in the same optimal forgery strategy obtained in case (ii) (a). Proceeding analogously 
as in this case, from the assumptions on i/j we derive the intervals of values of p where the solution is defined: (p^, p;+i] for 
i = 1, . . . , n — 1 and (p^, pi+i) for i = n. Given these intervals, it is then straightforward to check that 7£(p, 0) = if, and 
only if, p ^ p n . This provides us with the pairs (p, 0) that belong to cl^. 

In case (ii) (c), the condition ip ^ /ii(0,0) means that p = 0, or equivalently, r = 0. Observe that this is the solution 
stated in the theorem for p = Pi with i = 1. Then again, the condition J2k=j s k ~ cr leads to the same optimal suppression 
strategy found in case (ii) (a). From the assumptions in the lemma on u, we obtain the intervals (<7j, cFj-i] for j = 2, . . . , n 
and (<7j, (Jj-i) for j = 1. Then, we verify that 7^(0, cr) = if, and only if, a ^ <r 1? from which it follows the pairs (0, cr) that 
belong to cl^. 

Finally, the case (ii) (d) in the lemma, in which h n (0, 0) ^ uj and i\j ^ ^i(0, 0), corresponds to the trivial case cr = crj for 
j = n and p = Pi for i = 1, that is, the solution r = s = 0. 

After having applied Lemma [T] (ii) to function (p}, now we proceed with case (iii) (a). In applying it, we shall show that 
the solution claimed in the theorem is also valid for the extreme values of the intervals in case (ii) (a), specifically the set 

{(p,cr): p = pcritO),cr G Oj,crj-i] for j = 3, . . . ,n, and a £ (cr^c^-i) for j = 2}. 

Assume the case (iii) (a) in which hi(0 : 0) < i/j = uj < hj(0, 0) for some j = 2, . . . , n and i = j — 1. Under this assumption, 
the equality constraint J2\=i r & = P m me lemma is equivalent, after simple algebraic manipulation, to 

1 + p - <T \ (1 + P - Gj^j-l / 

where we define ( = J2k=i a k- Similarly, the equality constraint J2k=j Sk ~ a t> ecomes 



1 Qj-cr + C 



But ip = uj, therefore 
or equivalently, 



, log t-^ r=- + 1 

1 + p - a V (1 + P " °)Pj 



P = PcritO) + 



In short, the assumption i/j = uj imposes the condition (p, a) >z (Pcrit(cr) , cr) for some nonnegative sequence a±,...,a n 
satisfying the above equality. Next we examine, for a given a, these two possibilities, p = p cr it(c r ) and p > p cr it(^)- 

Consider the former possibility and observe that p = p cr it(^) if, and only if, a& = for k = 1, . . . , n. According to the 
lemma, the nonzero optimal solutions yield 

7-1/ ,x Qj-1 + PcritO) 

r k h k (ijj) =p k — — 5 g/c 

= Pfe(l + Pcrit(^") ~cr) -q k 

for /c = 1, . . . , j — 1, and 

Sfe = -fy^WO =9fe -P/c(l + Pcrit(cr) -^) 

for = j, . . . , n, that is, the solutions obtained after applying case (ii) (a), but evaluated at p = p cr it(c r ). From these expression 
for r and 8, it is immediate to verify then that t = p and thus 7£(p, a) =0. 
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| population's profile p 
] actual user profile q 



optimal forgery strategy r* 



optimal suppression strategy s* 



| population's profile p 

| optimal apparent profile t* 



Fig. 2: A user's item distribution is perturbed according to two optimal forgery and suppression strategies, in order for the resulting profile to minimize the 
KL divergence with respect to the population's distribution. 

Now we assume the latter possibility, i.e., (p, a) y (p cr i t (a), a), to show that the privacy-risk function also vanishes for 
these values of p and a. On account of part (iii) (a) of the lemma and (pj}, we derive the optimal forgery and suppression 
strategies 

Pk C 

Tk = Pfe(l + PcritO) - a) + q k + a k 



and Sk = &k for /c = 1, . . . , j — 1, and 



Sk = Qk ~ PkiX + Pcrit(cr) - o) 



PkC 



■ &k 



and Tk = OLk for k = j, . . . , n. Then, we substitute r and 8 back into the apparent profile t and check that D(t || p) = 0. In 
doing so, we determine the pairs (p, a) y that belong to cl ff, and finally obtain the expression for the boundary of the 
critical-privacy region claimed in statement (i) of the theorem. 

To conclude the proof, it remains only to write the privacy -risk function 7£(p, a) = J2k=i^k log ^ m terms of the optimal 
apparent distribution. With this aim, we split the summation into three parts. The first part, corresponding to tk = p k f^^-l) 9 
is 



V t k log — 



a 



•log 



Qi 



(1 



where we leverage on the fact that — does not depend on k. The second part of the sum, corresponding to tk 



Qk 



1+p-cr 



yields 



V t k log — = V — 



log 



The last part, corresponding to tk 



k=i+l 
_ Pk(Qj 



i+l 



(1 



IS 



tk_ 
Pk 



log 



Qj - (J 



where we also note that ^ does not depend on k either. Now, it is straightforward to identify the terms of 7£(p, a) as the KL 
divergence between the distributions 



and 



Q j - cr 

1 + p — (7 7 1 + p — (J 7 7 1 + p — a' 1 + p — (J 
(PijPi+l,...,^--!,^) , 



precisely the distributions stated in the theorem. ■ 
In light of Theorem |3j we would like to remark the intuitive principle that both the optimal forgery and suppression strategies 
follow. On the one hand, the forgery strategy suggests adding ratings to those categories with a low ratio that is, to those in 

Pk 

which the user's interest is considerably lower than the population's. On the other hand, the suppression strategy recommends 
eliminating ratings from those categories where the ratio ^ is high, i.e., where the interest of the user exceeds that of the 
population. 

Another straightforward consequence of Theorem [3] is the role of the forgery and the suppression thresholds. In particular, 
we identify pi as the forgery rate beyond which the components of r& for k = 1, . . . , i become positive. A similar reasoning 
applies to aj, which indicates the suppression rate beyond which the components of Sk for k = j, . . . , n are positive. In a 
nutshell, these thresholds determine the number of nonzero components of the optimal strategies. 
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Also, from this theorem we deduce that the perturbation of the user profile does not only affect those categories where 
either r/ e >0or,s/ e >0. In fact, since we are dealing with relative frequencies, the components of the apparent distribution tk 
belonging to the categories k = i + 1, . . . , j — 1 are normalized by 1+ 1 p _ a - Fig. [^illustrates these three conclusions by means 
of a simple example with n = 5 categories of interest. 

In this example we consider a user who is disposed to submit a percentage of false ratings p G (p2?P3]> an d to refrain 
from sending a fraction of genuine ratings a G (0-4,(73]. Given these rates, the optimal forgery strategy recommends that the 
user forge ratings belonging to the categories 1 and 2, where clearly there is a lack of interest, compared to the reference 
distribution. On the contrary, the suppression strategy specifies that the user eliminate ratings from the categories 4 and 5, that 
is, from those categories where they show too much interest, again compared to the population's profile. In adopting these two 
strategies, the apparent user profile approaches the population's distribution, especially in those components where the ratio ^ 
deviates significantly from 1. Finally, the component of the apparent profile £3, which is not directly affected by the forgery 
and the suppression strategies, gets closer to ps as a result of the aforementioned normalization. 

In the following subsections, we shall analyze a number of important consequences of Theorem [3] 



B. Orthogonality, Continuity and Proportionality 

In this subsection we study some interesting properties of the closed-form solution obtained in Sec. IV-A Specifically, we 
investigate the orthogonality and continuity of the optimal forgery and suppression strategies, and then establish a proportionality 
relationship between the optimal apparent user profile and the population's distribution. 

Corollary 4 (Orthogonality and Continuity): 

(i) For any (p, a) G cl^, the optimal forgery and suppression strategies satisfy sj£ = for k = 1, . . . , n. 

(ii) The components of r* and s*, interpreted as functions of p and a respectively, are continuous on cl^. 

Proof: The proof of (i) is trivial from Theorem [3] To prove statement (ii) we also resort to this theorem. According to it, 
each component may be regarded as a piecewise function of p defined on the contiguous, nonoverlapping intervals [p^, p;+i] 
for i = 1 and (p^, Pi+i] for i = 2, . . . , j — 1. A direct verification shows that, for any k = j, . . . ,n, the component is 
identically zero on the whole interval [pi,Pj] and hence continuous. For any k — 1, ... ,j — 1, we immediately check the 
continuity of on the interior of each of the intervals parameterized by i. Now we examine the endpoints of such intervals. 
The continuity at the extreme points pi and pj is verified straightforwardly as the intervals are closed at these points. Then, 
we check that the limit at the remaining endpoints pi exists, since 

lim_ rl(p) = j^-(Qi-! + pi) - q k 

= 7r(Qi +Pi) -Qk = Um . rjfe(p), 
±i p^pT 

for i = 2,...,j — 1. Because each limit coincides with the corresponding value r^(pi), we prove the continuity of the 
components n, . . . , Tj-\. The proof of the continuity of the components of s* is analogous to that of r*. ■ 

The orthogonality of the optimal forgery and suppression strategies, in the sense indicated by Corollary [4] (i), conforms to 
intuition — it would not make any sense to submit false ratings to items of a particular category and, at the same time, eliminate 
genuine ratings from this category. This intuitive result is illustrated in Fig. [2] The second part of Corollary [4] is applied to 
show our next result, Proposition [5] 

Proposition 5 (Proportionality): Define the piecewise functions 0(p, a) = ^^_^ p . and x{p-, cr ) — Q+p-cr)P- on me 
intervals [<7j, for j = 2, . . . ,n and [pi, Pi+i] for i = 1, . . . , j — 1. 

(i) For any j = 2, . . . , n and i = 1, . . . , j — 1, and for any a G [oj, (Tj-i] and p G [pi, pi+i], the optimal apparent profile 
t* and the population's distribution p satisfy 



Pi Pi 
+* _/.* 
J- = . . . = -*L 

Pj Pn 

and 



Pi+i Pj-i 

(ii) The function is continuous and strictly increasing in each of its arguments, and satisfies </>(p, a) < 1, with equality if, 
and only if, (p, a) = (pj(a),a). 

(iii) The function x is continuous and strictly decreasing in each of its arguments, and satisfies xiP^) ^ 1> wrtri equality 
if, and only if, (p, a) = (pj(a),a). 
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Fig. 3: Proportionality relationship between the optimal user's apparent item distribution and the population's profile. In this figure we show the ratios ^ of 
the example illustrated in Fig. [2] where the number of categories is n = 5, p G [p2, P3] and a £ [0*4 , (73]. 

Proof: The continuity of the components of t* on cl^ follows from Corollary [5] (ii). This allows us to write the intervals in 
Theorem [5] as and [<7j, in lieu of and (aj,Oj_i], respectively. From the expressions of r\ and s* k in 

the theorem, it is immediate to identify the ratios ^ as either cj)(p, a) or x(p-> cr )- The inner inequalities in statement (i) of this 
proposition also follow immediately from the labeling assumption ([5J. Direct manipulation shows that the outer inequalities 
^ ^ anc ^ ^7 ^ ^ are equivalent to p ^ and a ^ ctj-i, respectively. This proves (i). 
Next, we proceed to demonstrate the strict monotonicity of <\>. A simple calculation shows that 



Qi 



+1 



a 



Qj- 



> 0. Then, by the positivity 



dp (l + p-a) 2 P/ 

To prove that |^ > 0, it is sufficient to verify that Qj > crj_i, or equivalently, that P 3 
assumption ([2]), we immediately see that this latter inequality holds for any j = 2, . . . , n. The strict monotonicity of <\> in a 
also follows from assumption ([2]). 

To complete (ii), we write the condition a) ^ 1 as 

(1 - a)P % - Qi 



P^ 



Pi+l 



A routine computation shows that the equality holds for Pj(cr) and any a G [aj,Oj_i] with j — 2, . . . , n. Therefore, for 
any fixed a, the inequality holds strictly for any other p. The converse, that is, a) = 1 implies (p, a) = (pj(cr),cr), is 
immediate from the strict monotonicity of <p. The proof of statement (iii) proceeds along the same lines of that of (ii) and is 
omitted. ■ 
Our previous result tells us how perturbation operates. According to Proposition [5] the optimal strategies perturb the user 
profile in such a manner that, in those categories with the lowest and highest ratios ^ , the apparent profile becomes proportional 

to the population's distribution. More precisely, the common ratio ^ increases with both p and a in those categories affected 
by forgery, that is, k = 1, . . . , i. Exactly the opposite happens in those categories affected by suppression, where the common 
ratio ^ decreases with both rates. This tendency continues until p = Pcrit(c")> at which point t* = p. Fig. j^j illustrates this 
proportionality property in the case of the example depicted in Fig. [2] 



C. Critical-Privacy Region 

One of the results of Theorem [3] is that the boundary of the critical-privacy region is determined by the critical forgery- 
suppression threshold Pj(cr), which we also denote by p C nt(&) to highlight this fact. The following proposition leverages on 
this result and characterizes said region. In particular, Proposition [6] first examines some properties of this threshold and then 
investigates the convexity of the critical-privacy region. 
Proposition 6 (Convexity of the Critical-Privacy Region): 

(i) pj is a convex, piecewise linear function of a G [<Tj, for j = 2, . . . , n. 

(ii) ^ is convex. 

Proof: From Theorem |3j it is routine to check the continuity of pj on [cr n ,ai]. To show its convexity, we conveniently 
write this function as Pj(cr) = mj a + bj, where mj = — and bj = ?£=±z3i=l . Next, we prove that the slopes satisfy 
rrij < rrij-i for all j = 3, . . . , n. We proceed by contradiction, assuming that rrij ^ rrij-i. Note that this inequality is equivalent 
to Pj-iPj-i ^ Pj — PjPj-i and, after algebraic simplification, to Pj-i ^ 0. This contradicts the positivity assumption ([2]), 
which, in turn, implies that rrij < for all j = 2, . . . , n. Therefore, since pj is a piecewise linear function defined by the 
strictly increasing sequence of negative slopes {m n , ... ,777-2}, we can conclude that pj is convex. This proves statement (i). 
The second statement follows from the first one. As pj is convex, so is its epigraph, i.e., the critical-privacy region. ■ 

The conclusions drawn from Proposition [6] are illustrated in Fig. [4] In this figure we represent the critical and noncritical- 
privacy regions for n — 5 categories of interest; the distributions q and p assumed in this conceptual example are different 
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Fig. 4: Conceptual plot of the critical and noncritical privacy regions for n = 5 categories. 

from those considered in Figs. [2] and [3] That said, the figure in question shows a straightforward consequence of our previous 
proposition — the noncritical-privacy region is nonconvex. 

In this illustrative example, the sequences of forgery thresholds {pi . . . ,p 5 } and suppression thresholds {05, . . . , ai} are 
strictly increasing. By Proposition [2j we can conclude then that the inequalities of the labeling assumption ^ hold strictly. 
Related to these thresholds is also the number of nonzero components of the optimal strategies, as follows from Theorem [3] 
Fig. [5] shows the sets of pairs (p, a) where the number of nonzero components of r* and s* is fixed. Thus, in the triangular 
area shown darker, corresponding to the Cartesian product of the intervals [^3,^4] and [0-4,(73], the solutions r* and s* have 
i = 3 and n — j + 1 = 2 nonzero components, respectively. 



D. Case of Low Forgery and Suppression 

This subsection characterizes the privacy-forgery-suppression function in the special case when p, a ~ 0. 

Proposition 7 (Low Rates of Forgery and Suppression): Assume the nontrivial case in which q ^ p. Then, there exist two 
indexes z, j such that = p\ = • • • = pi < pi+i and = a n = • • • = Cj < Oj-\. For any p G [0, Pi+i] and a G [0, o-j-i], the 
number of nonzero components of the optimal forgery and suppression strategies is i and n — j + 1, respectively. Further, the 
gradient of the privacy-forgery- suppression function at the origin is 



Vft(0,0) 



« * ' log£-D(g||p) 
« VD(g||p)-log£ 



Proof: The existence of the indexes z and j is guaranteed by the assumption that q ^ p. The number of nonzero components 
of r* and s* is trivial from Theorem [3] In view of this theorem, for any p G [0, Pi+i] and a G [0, crj-i], we have 

'<7 + p(l,0,...,0)-a(0,...,0,l) 



1 + p-cr 

The continuity of the components of r* and s* proven in Corollary [4] (ii) ensures the continuity of the privacy-forgery- 
suppression function on . It is routine to check its differentiability in this region and to obtain its derivative with respect to 
a at the origin, 

dR(0,0) 
da 

On account of Proposition |2j the conditions p\ - 



and 



Qi log 


Qi Pj 1 
Pi Qj 


3-1 
k=i+l 


Pj Qk 

y J 
3 QjPk 


= A 


and cfj 


= •" = cr n 


imply 




_ Qi 


Q^ 




Pi 


Pi 


Pi 




m = . 


- Qn 


Qj 






Pn 







Therefore, 



- g ^- = E ft Iog--Q J _ 1 lo g - 
= D( g ||p)-log^. 
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The derivative of 1Z with respect to p at p = a = follows analogously. ■ 
Next, we shall derive an expression for the relative decrement of the privacy-risk function at p, a ~ 0. To this end, define 
the forgery relative decrement factor 



5p R(0,») 1 D(g||p)' 



and the suppression relative decrement factor 

dn(o,o) ] QQ . 

£ _ da _ g Pn _ 

CT ft(0,0) D( g ||p) ' 
By dint of Proposition [7] the first-order Taylor approximation of function ([T]) around p = a = yields 

ft(p, a) ~ D(g || p) + p flog * - Dfa || p)) + a (ofa || p) - log ^ 

or more compactly, in terms of the decrement factors, 

B(q\\p)-TZ(p,a) 
D(«||p) 

In words, the minimum and maximum ratios — characterize the relative reduction in privacy risk. The following result, 

— — Pk 

Proposition [8J establishes a bound on these relative decrement factors. 

Proposition 8 (Relative Decrement Factors): In the nontrivial case when q =^ p, the relative decrement factors satisfy S p > 1 
and Sq- > 0. 

Proof: Observe that the statement S p > 1 is equivalent to the condition q\ < p\. We prove this by contradiction. Suppose 
that qi > p\. By the labeling assumption ([5]), it follows that qk > pu for all k, what leads to the contradiction that 1 = Qk > 
^Pk = 1. Now assume that q\ — p\. Since q ^ p, there must exist an index i such that 

qi = = (H-i ft < K qn 

Pi Pi-l Pi ^ ^ Pn 

But this implies that 

i-l n 

1 ~^2vk = ^2qk> ^Pk = i - XN fe > 

/e=l /c=i k=i k=l 

a contradiction. This proves the first part of the proposition. 

For the second part, note that the statement 5 a > is equivalent to 

qi log 1 h <7n log — < log — , 

Pi Pn Pn 

and, after algebraic manipulation, to 

gilog 1 hgvi-ilog < 0. 

Pi q n Pn-i q n 

The positivity and labeling assumptions ([2]), ^ ensure that all terms in the sum are nonpositive. However, the additional 
assumption q ^ p implies that ^ < which in turn implies that the first term is negative and so is, consequently, the entire 
summation. ■ 
Conceptually, the bound on 5 p tells us that the relative decrement in privacy risk is greater than the forgery rate introduced. 
This is under the assumption that q ^ p and at low rates of forgery and suppression. The bound on 5 a , however, is looser than 
the previous one and just ensures that an increase in the suppression rate always leads to a decrease in privacy risk, as one 
would expect. 



i-l 



E. Pure Strategies 

In the previous subsections we investigated the forgery and the suppression of ratings as a mixed strategy that users may 
adopt to enhance their privacy. In this subsection we contemplate the case in which users may be reluctant to use these two 
mechanisms in conjunction; and as a consequence, they may opt for a pure strategy consisting in the application of either 
forgery or suppression. In this case, it would be useful to determine which is the most appropriate technique in terms of the 
privacy-utility trade-off posed. Our next result, Corollary [9] provides some insight on this, under the assumption that, from the 
user's perspective, the impact on utility due to forgery is equivalent to that caused by the effect of suppression. 

Before showing this result, observe from Theorem [5] that p n = ^ — 1 is the minimum forgery rate such that 7£(p, 0) = 0. 
Analogously, g\ — 1 — ^ is the minimum suppression rate satisfying 7^(0, a) — 0. In other words, p n and g\ are the critical 
rates of the pure forgery and suppression strategies, respectively. Further, note that g\ < ctq = 1, on account of the positivity 
assumption ([2]). However, p n > 1 if, and only if, ^ > 2. 
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Fig. 5: Contour lines of the privacy-forgery-suppression function, the corresponding forgery and suppression thresholds, and the critical and noncritical privacy 
regions. 

Corollary 9 (Pure Strategies): Consider the nontrivial case when q ^ p. 

(i) The critical rates of the pure forgery and suppression strategies satisfy p n < g\ if, and only if, 

qi/ Pl + q n / Pn 

2 

(ii) The forgery and the suppression relative decrement factors satisfy S p > 5 a if, and only if, 

V Pi Pn 

Proof: Both statements are immediate from the definitions of p n and o\ on the one hand, and S p and S a on the other. ■ 
In conceptual terms, the condition p n < g\ means that the pure forgery strategy is the most appropriate mechanism in terms 
of causing the minimum distortion to attain the critical-privacy region. On the other hand, the condition S p > 5 a implies that, 
at low rates, the pure forgery strategy offers better privacy protection than the pure suppression strategy does. Therefore, the 
conclusion that follows from Corollary [9] is that, together with the quantity D(g ||p), the arithmetic and geometric mean of the 
ratios — and — determine which strategy to choose. 

Another interesting remark is the duality of these two ratios — and — . The former characterizes the minimum rate for the 

to J Pi Pn 

pure suppression strategy to reach the critical-privacy region and, at the same time, it establishes the privacy gain at low forgery 
rates. Conversely, the latter ratio defines the critical rate of the pure forgery strategy and determines the relative decrement in 
privacy risk at low suppression rates. 

Lastly, we would like to establish a connection between our work and that of (TTJ, pQ) , where the pure forgery and 
suppression strategies are investigated. Denote by 7£p the function derived in [ 11 ] modeling the trade-off between forgery rate 
and privacy risk, the latter being measured as the KL divergence between the user's apparent profile and the population's 
distribution. Define p' as the ratio of forged ratings to total number of ratings. Accordingly, it can be shown that p' = and 
that TZ(p, 0) = Hf(p'). On the other hand, denote by Vs the function in [20] characterizing the trade-off between suppression 
rate and privacy gain. In this case, privacy is measured as the Shannon's entropy of the user's apparent profile. Under the 
assumption that the population's profile is uniform, it can be proven that 7^(0, a) = logn — Vs(a). In short, our formulation 
of the problem of optimal forgery and suppression of ratings encompasses, as particular cases, the cited works. 

F. Numerical Example 

This subsection presents a numerical example that illustrates the theoretical analysis conducted in the previous subsections. 
Later on in Sec. [V] we shall evaluate the effectiveness of our approach in a real scenario, namely in the movie recommendation 
system Movielens. In our numerical example we assume n = 3 categories of interests. Although the example shown here 
is synthetic, these three categories could very well represent interests across topics such as technology, sports and beauty. 
Accordingly, we suppose that the user's rating distribution is 



q = (0.130,0.440,0.430), 
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(001) 




(001) 



(100) 

(a) p = 0.050, a = 
0.093, K(p,(T) ~ 0.131, 
r* = (0.050,0,0), s* 
(0.189,0.463,0.347). 



(010) 

0.100, p/p C n t (a) ~ 
n(p,a)/n ~ 0.498, 
= (0,0,0.100), t* ~ 



(001) 




(100) 

(c) p ~ 0.219, cr = 0.300 ; 
7£(p,cr) = 0, n(p,(j)/n = 0, r 
s* ~ (0,0.081,0.219), t* = p. 



(010) 

P/Pcht(cr) = 1, 
(0.219,0,0), 



(100) 




(010) 



(b) p = 0.100, cr = 0.200, p/p C nt(cr) ~ 
0.356, ^(p,cr) ~ 0.050, lZ(p,a)/n ~ 0.190, 
r* = (0.100,0,0), s* ~ (0,0.019,0.181), t* ~ 
(0.256,0.468,0.276). 



(001) 



(100) 




(010) 



(d) p = 0.300, cr = 0.300, p/pcrit(cr) ^ 
1.368, ^(p,cr) = 0, K(p, cr) /1Z = 0, r* ~ 
(0.260,0.021,0.019), s* = (0.010,0.071,0.219), 
t* = p. 



Fig. 6: Probability simplices showing, for several interesting values of p and cr, the user's actual profile q = (0.130, 0.440, 0.430), the population's distribution 
p = (0.380, 0.390, 0.230), the optimal apparent distribution t* and the set of feasible apparent distributions. 



and the population's, 



p = (0.380,0.390,0.230). 



Note that these distributions satisfy the positivity and labeling assumptions ([2]), ([3J. 

From Sec. |IV-A| we easily obtain the forgery thresholds p\ = 0, p2 — 0.299 and ps ~ 0.870 on the one hand, and on the 
other the suppression thresholds <r 3 = 0, 02 — 0.171 and g\ ~ 0.658. The thresholds ps and g\ are the critical rates of the pure 
strategies. If we are to reach the critical-privacy region and do not have any preference for either forgery or suppression, the 
fact that ps > G\ leads us to opt for suppression as pure strategy. However, the geometric mean of ^ and ^ is approximately 
0.799, which is lower than 2 D ^ " p "> ~ 1.20. On account of Corollary [9J this means that the pure forgery strategy contributes to 
a greater reduction in privacy risk at low rates than suppression does. In fact, the gradient of the privacy-forgery-suppression 
function at the origin is V7£(0,0) T ~ (—1.81, —0.639), by virtue of Proposition [7] 

Fig. [5] shows the contour lines of this function, computed analytically from Theorem [3] and numerically [^] The region 
plotted in gray shades corresponds to the noncritical-privacy region . The initial privacy risk is 7^(0, 0) ~ 0.263. The white 
area represents the critical-privacy region ^ , where the apparent user profile coincides with the population's distribution and 
thus the privacy risk vanishes. An interesting observation arising from Fig. [5] is the synergistic effect of combining forgery and 
suppression. Just as an example, in the case when p = p2 and g = 02, the sum of these two distortion measures is lower than 
the critical rates of the pure strategies. 

Next, we examine the optimal apparent rating distribution for different values of p and g. For this purpose, the user's genuine 
distribution q, the population's distribution p and the optimal apparent distribution t* are depicted in the probability simplices 
shown in Fig. [6] In each simplex, we also represent the contour lines of the KL divergence D(- \\p) between every distribution 
in the simplex and p. Further, we plot the set of feasible apparent user distributions, not necessarily optimal, for four different 
combinations of p and g\ in any of these cases, the set takes the form of a hexagon. Having said this, now we turn our attention 
to Fig. |6(a) In this case, the optimal forgery and suppression strategies have i = n — j + 1 = 1 nonzero component, since 
p G [0, P2] and g G [0, <r 2 ]. This places the solution t* at one vertex of the hexagon. A remarkable fact is that, for these rates, 



( b) The numerical method chosen is the interior-point algorithm [41] implemented by the Matlab R2012b function fmincon. 
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TABLE I: Category index of the particular user examined in our experiments. The categories of Movielens have been sorted and indexed in order to satisfy 
the labeling assumption {3}. 



Index Category name Index Category name Index Category name 



1 animation 7 sci-fi 13 war 

2 action 8 comedy 14 mystery 

3 film-noir 9 thriller 15 musical 

4 children's 10 fantasy 16 romance 

5 adventure 11 horror 17 IMAX 

6 crime 12 western 18 drama 

19 documentary 



the privacy risk is approximately halved. In the end, consistently with Proposition [8] the forgery and the suppression relative 



decrement factors are 5 P ~ 6.87 > 1 and 5 a ~ 2.42 > 0. 



In the case shown in Fig. 6(b) r* still has i = 1 nonzero components, while 5* contains n — j + 1 = 2 nonzero components. 



Geometrically, the optimal apparent distribution lies at one edge of the feasible region. This lowers privacy risk to a 19% of 
its initial value. The case in which (p, cr) = (p C rit(^), cr) is depicted in Fig. 6(c) Here, the number of nonzero components of 
r* and s* remains the same as in the previous case, but the privacy risk becomes zero. The last case, illustrated in Fig. 6(d)| 
does not have any practical application, as 1l(p, cr) = for any (p, cr) G dF€ . In this figure we can observe that the solution t* 
is placed in the interior of the hexagon, and that the orthogonality principle of the strategies r* and s* stated in Corollary [4] 
is not satisfied. 



V. Experimental Evaluation 

In this section we evaluate the extent to which the forgery and the suppression of ratings could enhance user privacy in a 
real-world recommendation system. The system chosen to conduct this evaluation is Movielens, a popular movie recommender 
developed by the GroupLens Research Lab (42) at the University of Minnesota. As many other recommenders, Movielens allows 
users to both rate and tag movies according to their preferences. These preferences are then exploited by the recommender to 
suggest movies that users have not watched yet. 



A. Data set 

The data set that we used to assess our data-perturbative mechanism is the Movielens 10M data set (43), which contains 
10000054 ratings and 95 580 tags. The ratings and tags included in this data set were assigned to 10681 movies by 71 567 
users. The data are organized in the form of quadruples (username, movie, rating, time), each one representing the action of 
a user rating a movie at a certain time. Usernames have been replaced with numbers in an attempt to anonymize the data set. 

For our purposes of experimentation, we just needed the data fields username and movie, together with the categories each 
movie belongs to. Movielens contemplates n = 19 categories or movies genres, listed in alphabetical order as follows: action, 
adventure, animation, children's, comedy, crime, documentary, drama, fantasy, film-noir, horror, IMAX, musical, mystery, 



romance, sci-fi, thriller, war and western. As we shall see later in Sec. |V-B| for each particular user, we shall have to rearrange 
those categories in such a way that the labeling assumption ^ is satisfied. 

In our data set, all users rated, at least, 20 movies. This was the minimum number of ratings for the recommender to start 
working After the elimination of those users who exclusively tagged movies, the total number of users reduced to 69 878. 
Despite the large number of users, we found that only 4099 satisfied the positivity assumption ([2]). Considering that this small 
group of users represents just the 5.8% of the total number of users, we can assume that the application of our technique will 



have a negligible effect on the population's profile p, as supposed in Sec. III-D 



B. Results 

In this subsection we examine how the forgery and the suppression of ratings may help users of Movielens to enhance their 
privacy. With this aim, first, we analyze the effect of the perturbation of ratings on the privacy protection of a particular user 
from our data set. Secondly, we consider the entire set of 4099 users and assess the relative reduction in privacy risk when 
these users apply the same forgery and suppression rates. Lastly, we investigate the forgery and the suppression strategies 
separately, and draw some conclusions about these two pure strategies. 

To conduct our first experiments, we choose a particular user from our data set[^] Before perturbing the movie rating history 
of this user, it is necessary that the components of the user's profile q and the population's distribution p be rearranged to 
satisfy the labeling assumption ([3]). Table [I] shows how movie categories have been sorted, and then indexed from 1 to n, to 
fulfill the assumption above. We would like to note that the index provided in this table does not have to coincide with the 
index of other users in our data set. 

(c) Nowadays, the algorithm implemented by Movielens requires only 15 ratings to start generating predictions. 
^The user considered in this first series of experiments is identified by the number 33 01 in [43]. 
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Fig. 7: In this figure we represent (a) the item distribution q of a particular user as well as the population's item distribution p. In addition, we plot (b) 
the optimal forgery strategy r* and (c) the optimal suppression strategy s* that the user in question should adopt when they specify a = 0.150 and 
P = PcritW ^0.180. 



Fig. 7 (a) | depicts the user profile and the population profile, the latter being computed by averaging across the 69 878 users. 
From this figure we note that the user's interest far exceeds the population's in categories such as musical, romance, IMAX, 



drama and documentary. More precisely, such ratios ^ yield 

Pk 



~ (1.300, 1.306, 1.451, 1.728, 2.292). 

fc=15,...,19 

In this figure, we also observe that the user's interest and the population's in the category 17 are nearly zero, namely qu 
0.0005 and p 17 ~ 0.0003. 



On the other hand, Fig. 7(a) indicates that the user shows little interest, compared to the population's preferences, in categories 



such as animation, action, film-noir or children 's, to name just a few. Specifically, the first six smallest ratios ^ yield 



(0.444, 0.599, 0.651, 0.691, 0.705, 0.714). 



k=l, 



Figs . |7 (b)] and |7 (c) | show the optimal forgery and suppression strategies that this particular user should apply, in the case when 
a = 0.150 and p C rit(^) — 0.180. The solutions plotted in these figures are consistent with our two previous observations — the 
optimal forgery strategy recommends that the user submit false ratings to movies falling into the categories where the ratio ^ 
is low; and the optimal suppression strategy suggests that the user refrain from rating movies belonging to categories where 
the ratio ^ is high. Just as an example, the fact that s* 7 ~ 0.0001 means that the user at hand should eliminate one in five 
ratings to movies classified as IMAX. 

The optimal trade-off surface among privacy, forgery rate and suppression rate is represented in Fig. [8] In this figure we plot 
the contour levels of the function 7£(p, a), which we computed theoretically. The initial privacy risk is 7^(0,0) ~ 0.101 and 
the arithmetic mean between the ratios ^ and ^ yields approximately 1.37. Since the mean is higher than 1, Corollary |9| 
tells us that the user should opt for suppression as pure strategy, in lieu of forgery. This is under the assumption that they wisn 
to achieve the minimum privacy risk and do not have any preference for any of the pure strategies. Nevertheless, the fact that 
S p ~ 12.6 > S a ~ 10.9 leads us to choose forgery as pure strategy for p, a ~ 0. When both strategies are combined, note that 
a forgery and suppression rate of just 0.1% leads to a relative reduction in privacy risk of 2.35%, on account of the first-order 



Taylor approximation derived in Sec. IV-D 
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Fig. 8: Optimal trade-off surface among privacy, forgery rate and suppression rate for one particular user in our data set. The four points shown in this figure 
correspond to the pairs of values (p,cr) that we used to show the proportionality relationship between t* and p in Fig. [5] 

In Fig. [8] we have also plotted 4 points, which correspond to the following pairs of values (p, a): (0.03, 0.04), (0.06, 0.08), 



(0.11,0.12) and (0.18,0.15). For each of these pairs, we have represented the quotient ^ in Fig. |9j The aim is to show how 
the optimal apparent profile becomes proportional to the population's distribution, as the user approaches the critical-privacy 
region. Fig. 9(a) considers the first pair of values. Here, p and a fall into the intervals [p^pj] and [018,017], respectively. 



Consistently with Proposition 
In Fig. 



9(b) 



5 



we check that 



H 
Pi 



P6 



0.756 and that 



P19 



1.52. 



we double the rates of forgery and suppression. On the one hand, this leads to ^ 
the fact that a E [cr 15 , 014] implies that ^ — • • • — ^ m 



On the other, 



Pl P7 ' 

It is also interesting to note that, for these relatively small values of 
p and 0, the final privacy risk is 26% of the initial value D(q\\p). 

As p and increase, so does the function <j>. The contrary happens with the function x, which decreases with both rates. 
In Fig. |9(c)| for example, the proportionality relationship between t* and p holds for all except 4 categories. The last pair 
(p, a) ~ (0.18, 0.15) lies at the boundary of ^ , as shown in Fig. JsJ This implies that ^- = 1 and therefore that 7£(p, 0) = 0, 



as captured in Fig. |9(d) 



Having examined the case of a specific user, in our next series of experiments we evaluate the privacy-protection level that 
users can achieve if they are disposed to forge and eliminate a fraction of their ratings. For simplicity, we suppose that all 
users satisfying the positivity assumption ^ apply a common forgery rate and a common suppression rate. Fig. [10| depicts 
the contours of the 10 th , 50 th and 90 th percentile surfaces of relative reduction in privacy risk, for different values of p and 0. 
Two conclusions can be drawn from this figure. 

• First, for relatively small values of p and (lower than 15%), a vast majority of users lowered privacy risk significantly. 
In quantitative terms, we observe in Fig. 10(a) that, for p = a = 0.05, the 10% of users adhered to our technique obtained 
a reduction in privacy risk by at least 52.4%. For those same rates of forgery and suppression rates, the 50 th and 90 th 
percentiles are 73.9% and 94.8%. For higher rates, e.g., p = = 0.15, Fig. 10(b) highlights that half of users experienced 
a reduction in privacy risk less than or equal to 100%. 

• Secondly, the three percentile surfaces exhibit a certain symmetry with respect to the line p = a. If this symmetry were 
exact, the exchange of the rates of forgery and suppression would not have any impact on the resulting privacy-protection 



achieved. However, this is not the case. For example, Fig. |10(a)| shows a lower reduction in privacy risk for p < a, 
particularly accentuated when a ~ 0. The reason for this may be found in the fact that, for most users, p n is greater than 
a\. We shall elaborate more on this later on when we consider forgery and suppression as pure strategies. 
Next, we analyze the privacy protection provided by our technique for p, a ~ 0. In the theoretical analysis conducted in 
Sec. |IV-D| we derived an expression for the relative reduction in privacy risk at low rates. Particularly, said expression was in 



terms of two factors, namely 5 p and 5^. In Fig. 11 we show the probability distribution of these factors. Consistently with 



Proposition [8] the minimum values of these factors are S p ~ 3.12 > 1 and 5 a ~ 2.30 > 0. The maximum values attained by 
these forgery and the suppression factors are approximately 324.98 and 266.13. On the other hand, in favour of suppression is 
the fact that the percentage of users with S p ^ 30 is lower than those users with S a ^ 30. More precisely, these percentages 
yield 26.8% and 33.1%, respectively. In the end, an eye-opening finding is that S p > S a in 43.45% of users, which suggests 
introducing a suppression rate higher than that of forgery, at least at low rates. 

After analyzing the forgery and the suppression of ratings as a mixed strategy, our last experimental results contemplate the 



application of forgery and suppression as pure strategies. In Fig. [12] we illustrate the probability distribution of the critical rates 
p n and G\. The critical forgery rate ranges approximately from 0.171 to 54.18, and its average is 3.45. The critical suppression 
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Fig. 9: Proportionality relationship between, on the one hand, the optimal apparent item distribution t* of the user identified as 3301 in our data set, and on 
the other, the population's item distribution p. 

rate, on the other hand, goes from 0.153 to 0.963, and its average is 0.632. These figures indicate that, on average, a user will 
have either to refrain from rating an item six out of ten times, or submit nearly 3.45 false ratings per each original rating. This 
is, of course, when the user wishes to reach the critical-privacy region. Bearing these figures in mind, it is not surprising then 
that 95.3% of the users in our data set would opt for suppression as pure strategy, as it comes at the cost of a lower impact 
on utility. 



VI. Conclusion 

In the literature of recommendation systems there exists a variety of approaches aimed at protecting user privacy. Among 
these approaches, the forgery and the suppression of ratings emerge as a technique that may hinder attackers in their efforts 
to accurately profile users on the basis of the items they rate. Our technique does not require that users trust neither the 
recommender nor the network operator, it is simple in terms of infrastructure requirements, and it can be used in combination 
with other approaches providing soft privacy. However, as any data-perturbative approach, our privacy-enhancing technology 
comes at the expense of a loss in data utility, in particular a degradation of the quality of the recommender' s predictions. Put 
another way, it poses a trade-off between privacy and utility. 

The objective of this paper is to investigate mathematically said trade-off. For this purpose, first we propose a quantitative 
measure of both privacy and utility. We quantify privacy risk as the KL divergence between the user's rating distribution and the 
population's, and measure utility as the fraction of ratings the user is willing to forge and suppress. With these two quantities, 
we formulate a multiobjective optimization problem characterizing the trade-off between privacy risk on the one hand, and on 
the other forgery rate and suppression rate. 

Our theoretical analysis provides a closed-form solution to this problem and characterizes the optimal trade-off surface 
between privacy and utility. The solution is confined to the closure of the noncritical-privacy region. The interior of the 
critical-privacy region is of no interest as the privacy risk attains its minimum value at the boundary of . In the region of 
interest, our analysis finds that the optimal forgery and suppression strategies are orthogonal. In addition, these two strategies 
follow an intuitive principle. The forgery strategy recommends adding ratings to those categories where the user's interest is 
lower than the population's. The suppression strategy suggests eliminating those ratings belonging to the categories where the 
user shows too much interest compared to the reference distribution. 

Our theoretical study also examines how these optimal strategies perturb user profiles. It is interesting to observe that the 
optimal apparent profile becomes proportional to the population's distribution in those categories with the lowest and highest 
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assume that the 4 099 users satisfying the positivity assumption {2} protect their privacy by using a common forgery rate and a common suppression 
this assumption, we plot some percentiles surfaces of relative reduction in privacy risk, against these two common rates. 
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Fig. 11: Probability distribution of the relative decrement factors of forgery and suppression. 

ratios Our analysis also includes the characterization of 1Z at low rates of forgery and suppression. More accurately, we 
provide a first-order Taylor approximation of the privacy-utility trade-off function, from which we conclude that the ratios ^ 
and determine, together with the quantity D(q \\p), the privacy risk at low rates. An eye-opening fact is that the relative 
decrement in privacy risk is greater than the forgery rate introduced. 

Further, we consider the special case when forgery and suppression are not used in combination. Under this consideration, 
we investigate which one is the most appropriate technique, first, in terms of causing the minimum distortion to reach the 
critical-privacy region, and secondly, in terms of offering better privacy protection at low rates. Our findings show that the 
arithmetic and geometric mean of the maximum and minimum ratios ^ play a fundamental role in deciding the best technique 
to use. Afterwards, our formulation and theoretical analysis are illustrated with a numerical example. 
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Fig. 12: Probability distribution of the critical forgery and suppression rates. 

In the end, the last section is devoted to the experimental evaluation of our data-perturbative mechanism in a real-world 
recommendation system. In particular, we examine how the application of the forgery and the suppression of ratings may 
preserve user privacy in Movie lens. Among other results, we find that a large majority of users significantly reduce privacy risk 
for forgery and suppression rates of just 15%. In our data set, the probability distributions of the relative decrement factors 
indicate that, at low rates, forgery provides a higher reduction in privacy risk than suppression does. By contrast, we observe 
that the suppression relative decrement factor is greater than that of forgery in 43.45% of users. Lastly, we consider the case 
when users must opt for either forgery or suppression; and find that the latter is the best strategy to use in 95.3% of users 
who wish to vanish privacy risk while causing the minimum distortion. 
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